feat: add native support for get_json_object expression by andygrove · Pull Request #3747 · apache/datafusion-comet

andygrove · 2026-03-20T17:26:10Z

Which issue does this PR close?

Closes #3162.

Rationale for this change

get_json_object is a widely-used Spark function for extracting values from JSON strings using JSONPath expressions. Without native support, queries using this function fall back to Spark's JVM execution. This PR adds an initial native implementation to allow Comet to accelerate these queries.

This is a starting point. The expression is marked Incompatible and is disabled by default. Users must set spark.comet.expression.GetJsonObject.allowIncompatible=true to enable it.

What changes are included in this PR?

Rust implementation (native/spark-expr/src/string_funcs/get_json_object.rs):

Custom JSONPath parser supporting $ (root), .field, ['field'] (bracket notation), [n] (array index), and [*] (array wildcard)
Path evaluation with separate fast-path for non-wildcard paths (zero Vec allocations) and wildcard paths
Uses serde_json with preserve_order feature for Spark-compatible key ordering
19 unit tests

Scala serde (spark/src/main/scala/org/apache/comet/serde/strings.scala):

CometGetJsonObject with getSupportLevel returning Incompatible (Spark's Jackson parser allows single-quoted JSON and unescaped control characters that serde_json does not)

Registration and wiring:

Added to stringExpressions map in QueryPlanSerde.scala
Registered in comet_scalar_funcs.rs via scalarFunctionExprToProtoWithReturnType

SQL tests (get_json_object.sql): 30 test queries covering field extraction, nested objects, arrays, wildcards, nulls, invalid JSON, bracket notation, edge cases.

Docs: Updated expressions.md and spark_expressions_support.md.

Current performance

Benchmarked with 1M rows of JSON (~200 bytes each) on Apple M3 Ultra:

Case	Spark (ms)	Comet (ms)	Relative
Simple field (`$.name`)	705	785	0.9X
Numeric field (`$.age`)	725	789	0.9X
Nested field (`$.address.city`)	773	805	1.0X
Array element (`$.items[0]`)	734	795	0.9X
Nested object (`$.address`)	869	926	0.9X

Comet is currently ~10% slower than Spark. The primary reason is that serde_json parses the full JSON document into a DOM tree on every row, while Spark's Jackson-based implementation uses a streaming parser that can skip irrelevant fields without allocating.

Known limitations and future work

This is an initial implementation. Known gaps that could be addressed in follow-up PRs:

Streaming JSON parser: Replace serde_json::from_str (full DOM parse) with a streaming approach (e.g., jiter or custom serde_json::Deserializer with IgnoredAny) to skip irrelevant JSON content without allocating. This would likely close the performance gap with Spark.
$.* on arrays: Spark distinguishes $.* (object wildcard, using Wildcard token) from $[*] (array wildcard, using Subscript::Wildcard). Our parser treats both as the same Wildcard segment. Currently $.* on arrays returns values in Comet but null in Spark.
Double wildcard flattening: Spark's $[*][*] triggers FlattenStyle which flattens nested arrays. Our implementation doesn't handle this special case.
Single wildcard match after index: For patterns like $.arr[0][*].field, Spark's WriteStyle state machine may produce different wrapping behavior than our count-based approach.
preserve_order is workspace-wide: Cargo unifies features, so enabling preserve_order on serde_json in spark-expr also enables it for all other crates in the workspace. Could be addressed by isolating the JSON parsing behind a feature flag.

How are these changes tested?

19 Rust unit tests covering path parsing and evaluation edge cases
30 SQL-file-based tests (CometSqlFileTestSuite) that run each query through both Spark and Comet and compare results, with dictionary encoding on/off
Microbenchmark (CometGetJsonObjectBenchmark) comparing Spark vs Comet performance across 5 query patterns

Implement the Spark GetJsonObject expression natively using serde_json for JSON parsing and a custom JSONPath evaluator supporting field access, array indexing, bracket notation, and wildcards. Closes apache#3162

Mark as Incompatible since Spark's Jackson parser allows single-quoted JSON and unescaped control characters which serde_json does not support. Add allowIncompatible config to SQL test file.

- Enable serde_json preserve_order feature to maintain JSON key ordering - Fix wildcard to only work on arrays (not objects), matching Spark - Fix single wildcard match to preserve JSON string quoting - Add user-facing docs in expressions.md - Add more SQL tests: object wildcard, single match, missing fields, invalid paths, field names with special chars, key ordering - Add Rust unit tests for new edge cases

Benchmarks simple field, numeric field, nested field, array element, and nested object extraction with 1M rows of JSON data.

- Move StringBuilder import to top-level imports - Fix doc comment to not mention object wildcard (unsupported) - Pre-compute has_wildcard in ParsedPath struct (avoids per-row scan) - Split evaluation into evaluate_no_wildcard (returns Option, zero Vec allocations) and evaluate_with_wildcard (returns Vec) - Simplify single wildcard match: serialize directly instead of array-wrap-then-strip-brackets hack - Add comment in Cargo.toml explaining preserve_order requirement

andygrove added 5 commits March 20, 2026 10:38

feat: add native support for get_json_object expression

bbbb49e

Implement the Spark GetJsonObject expression natively using serde_json for JSON parsing and a custom JSONPath evaluator supporting field access, array indexing, bracket notation, and wildcards. Closes apache#3162

feat: add getSupportLevel for CometGetJsonObject

257af9e

Mark as Incompatible since Spark's Jackson parser allows single-quoted JSON and unescaped control characters which serde_json does not support. Add allowIncompatible config to SQL test file.

feat: add microbenchmark for get_json_object

b73fdc1

Benchmarks simple field, numeric field, nested field, array element, and nested object extraction with 1M rows of JSON data.

andygrove marked this pull request as draft March 20, 2026 17:26

style: format markdown with prettier

f558e0f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add native support for get_json_object expression#3747

feat: add native support for get_json_object expression#3747
andygrove wants to merge 6 commits intoapache:mainfrom
andygrove:get-json-object

andygrove commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Mar 20, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Current performance

Known limitations and future work

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant